Credit Card Default Predictor (DSCI Group Project Proposal)¶
Introduction¶
Defaulting on a credit card payment is the event that a client of a bank does not pay the credit card balance, hence causing the bank/lender to lose money. And hence the chance that someone would default on a payment is a significant impact on whether a bank should approve a credit card application as well as the credit limit given.
This project aims to come up with a classification predictive model on whether the an account will default on his/her next credit card payment. The dataset is downloaded from Kaggle.
This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005. [1]
I will be using education, payment_ratio, and average_pay_status (the last 2 predictors created from PAY_1 through PAY_6, BILL_AMT2 through BILL_AMT6, and PAY_AMT1 through PAY_AMT5) and default.payment.next.month for the value we are predicting for reasons explained in the data summarization and visualization steps
Legend¶
Table 1, Credit Card Data
Table 2, Cleaned Credit Card Data
Table 3, Training Credit Card Data
Table 4, Number of Default Next Month
Table 5, Sex and Percent of Default
Table 6, Marriage and Percent of Default
Table 7, Education and Percent of Default
Figure 1, Credit Limit and Percent of Default
Figure 2, Age and Percent of Default
Figure 3, Pay Status and Percent Default
Figure 4, Pay Amount vs Bill Amount by month, colored by default next month
Figure 5, Pay Amount vs Bill Amount, colored by default next month
Table 8, Credit Card Training and Payment Ratio Calculation
Figure 6, Average Payment Ratio and Count of Default Next Month
Table 9, Credit Card Training after Average Pay Status Training
Figure 7, Average Pay Status and Count of Default Next Month
Table 10, Credit Card Training after Education Column Cast
Figure 8, Estimated Accuracy vs Number of Neighbors
Table 11, Credit Card Test with Prediction
Table 12, Prediction Accuracy Table
Table 13, Credit Card Confusion Matrix
Figure 9, Pay Status vs Pay Ratio, colored by Actual and Predicted Default Next Month
Figure 10, Education vs Pay Ratio, colored by Actual and Predicted Default Next Month
Figure 11, Education vs Pay Status, colored by Actual and Predicted Default Next Month
Methods and Results¶
Library Import¶
I used suppressPackageStartupMessages for cleaniness
suppressPackageStartupMessages(library(tidyverse))
library(repr)
suppressPackageStartupMessages(library(tidymodels))
options(repr.matrix.max.rows = 6)
library(ggplot2)
suppressPackageStartupMessages(require(gridExtra))
set.seed(9999)
Data Import¶
data_url <- url("https://raw.githubusercontent.com/mlool/dsci-100-2023W1-group-008-31/main/data/UCI_Credit_Card.csv")
credit_card_data <- read_csv(data_url, show_col_types = FALSE) # again, for cleaniness
credit_card_data
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ⋯ | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ⋯ | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ⋯ | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ⋯ | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29998 | 30000 | 1 | 2 | 2 | 37 | 4 | 3 | 2 | -1 | ⋯ | 20878 | 20582 | 19357 | 0 | 0 | 22000 | 4200 | 2000 | 3100 | 1 |
| 29999 | 80000 | 1 | 3 | 1 | 41 | 1 | -1 | 0 | 0 | ⋯ | 52774 | 11855 | 48944 | 85900 | 3409 | 1178 | 1926 | 52964 | 1804 | 1 |
| 30000 | 50000 | 1 | 2 | 1 | 46 | 0 | 0 | 0 | 0 | ⋯ | 36535 | 32428 | 15313 | 2078 | 1800 | 1430 | 1000 | 1000 | 1000 | 1 |
Table 1, Credit Card Data
Cleaning and Wrangling¶
First, I wish to remove all rows that has a datapoint outside the specificed possible values gotten from the source (eg. a 10 when the column should have have value 0 to 5). I chose to do it before splitting since those invalid entries are going to be removed anyways and as those are the majority of the entries (only 4000 out of 30000 is valid) if we split first then filter, we may end up in cases only very few valid entries in the test data.
category_colnames <- c("SEX", "EDUCATION", "MARRIAGE", "default.payment.next.month")
sex_categories <- c(1, 2)
education_categories <- c(1, 2, 3, 4, 5, 6)
marriage_status <- c(1, 2, 3)
pay_status <- c(-1, 1, 2, 3, 4, 5, 6, 7, 8, 9)
credit_card_tidy <- credit_card_data |>
rename(PAY_1 = PAY_0) |> # Since everything else start at 1 instead of 0
mutate(across(all_of(category_colnames), ~as_factor(.x))) |>
filter(SEX %in% sex_categories,
EDUCATION %in% education_categories,
MARRIAGE %in% marriage_status,
PAY_1 %in% pay_status,
PAY_2 %in% pay_status,
PAY_3 %in% pay_status,
PAY_4 %in% pay_status,
PAY_5 %in% pay_status,
PAY_6 %in% pay_status,)
credit_card_tidy
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | ⋯ | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> |
| 12 | 260000 | 2 | 1 | 2 | 51 | -1 | -1 | -1 | -1 | ⋯ | 8517 | 22287 | 13668 | 21818 | 9966 | 8583 | 22301 | 0 | 3640 | 0 |
| 22 | 120000 | 2 | 2 | 1 | 39 | -1 | -1 | -1 | -1 | ⋯ | 0 | 632 | 316 | 316 | 316 | 0 | 632 | 316 | 0 | 1 |
| 29 | 50000 | 2 | 3 | 1 | 47 | -1 | -1 | -1 | -1 | ⋯ | 2040 | 30430 | 257 | 3415 | 3421 | 2044 | 30430 | 257 | 0 | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29977 | 40000 | 1 | 2 | 2 | 47 | 2 | 2 | 3 | 2 | ⋯ | 51259 | 47151 | 46934 | 4000 | 0 | 2000 | 0 | 3520 | 0 | 1 |
| 29992 | 210000 | 1 | 2 | 1 | 34 | 3 | 2 | 2 | 2 | ⋯ | 2500 | 2500 | 2500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 29995 | 80000 | 1 | 2 | 2 | 34 | 2 | 2 | 2 | 2 | ⋯ | 77519 | 82607 | 81158 | 7000 | 3500 | 0 | 7000 | 0 | 4000 | 1 |
Table 2, Cleaned Credit Card Data
Then after removing the invalid entries we split the data to perform analysis on only the training set to avoid violating the golden rule of machine learning.
credit_card_split <- initial_split(credit_card_tidy, prop = 0.75, strata = default.payment.next.month)
credit_card_training <- training(credit_card_split)
credit_card_testing <- testing(credit_card_split)
credit_card_training
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | ⋯ | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default.payment.next.month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | ⋯ | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> |
| 12 | 260000 | 2 | 1 | 2 | 51 | -1 | -1 | -1 | -1 | ⋯ | 8517 | 22287 | 13668 | 21818 | 9966 | 8583 | 22301 | 0 | 3640 | 0 |
| 29 | 50000 | 2 | 3 | 1 | 47 | -1 | -1 | -1 | -1 | ⋯ | 2040 | 30430 | 257 | 3415 | 3421 | 2044 | 30430 | 257 | 0 | 0 |
| 31 | 230000 | 2 | 1 | 2 | 27 | -1 | -1 | -1 | -1 | ⋯ | 15339 | 14307 | 36923 | 17270 | 13281 | 15339 | 14307 | 37292 | 0 | 0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29977 | 40000 | 1 | 2 | 2 | 47 | 2 | 2 | 3 | 2 | ⋯ | 51259 | 47151 | 46934 | 4000 | 0 | 2000 | 0 | 3520 | 0 | 1 |
| 29992 | 210000 | 1 | 2 | 1 | 34 | 3 | 2 | 2 | 2 | ⋯ | 2500 | 2500 | 2500 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 29995 | 80000 | 1 | 2 | 2 | 34 | 2 | 2 | 2 | 2 | ⋯ | 77519 | 82607 | 81158 | 7000 | 3500 | 0 | 7000 | 0 | 4000 | 1 |
Table 3, Training Credit Card Data
Data Summarization¶
# Number of observation in each class
training_count <- credit_card_training |>
group_by(default.payment.next.month) |>
summarize(count = n())
training_count
| default.payment.next.month | count |
|---|---|
| <fct> | <int> |
| 0 | 1958 |
| 1 | 1077 |
Table 4, Number of Default Next Month
Observation: I notice there's slightly more none defaults than defaults, leading me to think perhaps we should measure our model based on precision/recall instead of accuracy alone.
training_sex_count <- credit_card_training |>
group_by(SEX, default.payment.next.month) |>
summarize(count = n()) |>
group_by(SEX) |>
summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
mutate(SEX = case_when(SEX == 1 ~ "male", SEX == 2 ~ "female"))
training_sex_count
`summarise()` has grouped output by 'SEX'. You can override using the `.groups` argument.
| SEX | percent_of_default |
|---|---|
| <chr> | <dbl> |
| male | 0.3837580 |
| female | 0.3344576 |
Table 5, Sex and Percent of Default
Observation: I notice that single and other have a slightly higher chance to default, however the difference is small, so might not be an important factor
training_marriage_count <- credit_card_training |>
group_by(MARRIAGE, default.payment.next.month) |>
summarize(count = n()) |>
group_by(MARRIAGE) |>
summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
mutate(SEX = case_when(MARRIAGE == 1 ~ "married", MARRIAGE == 2 ~ "single", MARRIAGE == 3 ~ "other"))
training_marriage_count
`summarise()` has grouped output by 'MARRIAGE'. You can override using the `.groups` argument.
| MARRIAGE | percent_of_default | SEX |
|---|---|---|
| <fct> | <dbl> | <chr> |
| 1 | 0.3527919 | married |
| 2 | 0.3542977 | single |
| 3 | 0.5000000 | other |
Table 6, Marriage and Percent of Default
Observation: We notice that there is barely any differrence between married and single, so we should also omit this column in our model
training_education_count <- credit_card_training |>
group_by(EDUCATION, default.payment.next.month) |>
summarize(count = n()) |>
group_by(EDUCATION) |>
summarize(percent_of_default = sum(count[default.payment.next.month == 1]) / sum(count)) |>
mutate(EDUCATION = case_when(EDUCATION == 1 ~ "grad",
EDUCATION == 2 ~ "university",
EDUCATION == 3 ~ "high school",
EDUCATION == 4 ~ "others",
EDUCATION == 5 ~ "unknown",
EDUCATION == 6 ~ "unknown",
))
print(training_education_count)
`summarise()` has grouped output by 'EDUCATION'. You can override using the `.groups` argument.
# A tibble: 6 × 2 EDUCATION percent_of_default <chr> <dbl> 1 grad 0.278 2 university 0.416 3 high school 0.413 4 others 0 5 unknown 0 6 unknown 0
Table 7, Education and Percent of Default
Observation: I notice that grad has a lower chance to default than university and high school, although difference is small between university and highschool, grad is quite a big difference, so might be an important factor
Data Visualization¶
options(repr.plot.width = 5, repr.plot.height = 5)
limit_bal_plot <- credit_card_training |>
ggplot(aes(x = LIMIT_BAL, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "Credit Limit", y = "percent default") +
theme(legend.position="bottom")
limit_bal_plot
Figure 1, Credit Limit and Percent of Default
Observation: I see that although there is a general downward trending ratio as credit limit increases, it goes back up and start having eratic patterns. This is probably due to the limited sample size with high credit limit. Therefore I don't think we should use this column as a predictor due to the lack of data across the board
age_plot <- credit_card_training |>
ggplot(aes(x = AGE, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "age", y = "percent default") +
theme(legend.position="bottom")
age_plot
Figure 2, Age and Percent of Default
Observation: I see that there is no clear trend between age and defaults, leading me to think that we do not need this predictor either
options(repr.plot.width = 12, repr.plot.height = 10)
pay_1_plot <- credit_card_training |>
ggplot(aes(x = PAY_1, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (Sep 2005)", y = "percent default") +
theme(legend.position="bottom")
pay_2_plot <- credit_card_training |>
ggplot(aes(x = PAY_2, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (Aug 2005)", y = "") +
theme(legend.position="bottom")
pay_3_plot <- credit_card_training |>
ggplot(aes(x = PAY_3, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (Jul 2005)", y = "") +
theme(legend.position="bottom")
pay_4_plot <- credit_card_training |>
ggplot(aes(x = PAY_4, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (Jun 2005)", y = "percent default") +
theme(legend.position="bottom")
pay_5_plot <- credit_card_training |>
ggplot(aes(x = PAY_5, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (May 2005)", y = "") +
theme(legend.position="bottom")
pay_6_plot <- credit_card_training |>
ggplot(aes(x = PAY_6, fill = default.payment.next.month)) +
geom_bar(stat = "count", position = "fill") +
labs(x = "pay status (Apr 2005)") +
theme(legend.position="bottom")
grid.arrange(pay_1_plot, pay_2_plot, pay_3_plot, pay_4_plot, pay_5_plot, pay_6_plot, nrow = 2, ncol = 3)
Figure 3, Pay Status and Percent Default
Observation: We see those who do not pay duly tend to default more than those who does, the difference is much quite significant so this might be an important factor in our model.
# Offset the x and y since you pay for the previous month
plot_1 <- ggplot(credit_card_training, aes(x = BILL_AMT2, y = PAY_AMT1, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Bill Amount (Aug 2005)",
y = "Pay Amount (Sep 2005)",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
plot_2 <- ggplot(credit_card_training, aes(x = BILL_AMT3, y = PAY_AMT2, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Bill Amount (Jul 2005)",
y = "Pay Amount (Aug 2005)",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
plot_3 <- ggplot(credit_card_training, aes(x = BILL_AMT4, y = PAY_AMT3, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Bill Amount (Jun 2005)",
y = "Pay Amount (July 2005)",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
plot_4 <- ggplot(credit_card_training, aes(x = BILL_AMT5, y = PAY_AMT4, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Bill Amount (May 2005)",
y = "Pay Amount (Jun 2005)",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
plot_5 <- ggplot(credit_card_training, aes(x = BILL_AMT6, y = PAY_AMT5, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Bill Amount (Apr 2005)",
y = "Pay Amount (May 2005)",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
grid.arrange(plot_1, plot_2, plot_3, plot_4, plot_5, nrow = 3, ncol = 2)
Figure 4, Pay Amount vs Bill Amount by month, colored by default next month
Observation: We can see there are patterns, but to incorporate the comment made in the proposal, I have made a summary plot with all datapoints, and I will address the findings there
options(repr.plot.width = 8, repr.plot.height = 7)
pay_bill_training <- credit_card_training |>
select(BILL_AMT2:PAY_AMT5, default.payment.next.month)
bill_amount <- tibble()
pay_amount <- tibble()
default_next_month <- tibble()
bill_column_list <- c("BILL_AMT2", "BILL_AMT3", "BILL_AMT4", "BILL_AMT5", "BILL_AMT6")
pay_column_list <- c("PAY_AMT1", "PAY_AMT2", "PAY_AMT3", "PAY_AMT4", "PAY_AMT5")
for (i in 1:5) {
selected_bill_amount <- pay_bill_training[bill_column_list[i]] |>
rename(bill_amount = bill_column_list[i])
selected_pay_amount <- pay_bill_training[pay_column_list[i]] |>
rename(pay_amount = pay_column_list[i])
bill_amount <- bind_rows(bill_amount, selected_bill_amount)
pay_amount <- bind_rows(pay_amount, selected_pay_amount)
default_next_month <- bind_rows(default_next_month, credit_card_training["default.payment.next.month"])
}
result_tibble <- tibble(bill_amount, pay_amount, default_next_month)
pay_bill_plot <- ggplot(result_tibble, aes(x = bill_amount, y = pay_amount, color = default.payment.next.month)) +
geom_point(alpha=0.2) +
labs(x = "Bill Amount",
y = "Pay Amount",
color="Default Payment Next Month (1=yes, 0=no)") +
theme(legend.position="bottom")
pay_bill_plot
Figure 5, Pay Amount vs Bill Amount, colored by default next month
Observation: I notice that those who pay the full amount of the previous bill tends not to default (straight line) but those who don't pay tends to default (low pay amount). This would also be an important predictor in our model. Although I do feel some form of feature engineering might be needed as the relationship between 2 columns is important in predicting the result rather than the columns on their own. (eg. if one has a bill amount of 100000 and one has an amount of 100, we can't really tell who is going to default unless we also see how much they also paid).
Feature Engineering¶
Payment Billing Ratio¶
I wish to create a function that converts BILL_AMT and PAY_AMT into a ratio of repayment, simply by finding the average percent of repayment over the months, more precisely
$$result = \max(\min(\frac{\sum_{i=1}^{n} \text{Pay Amount for Month } i}{\sum_{i=1}^{n} \text{Bill Amount for Month } i}, 1), 0)$$
I also bound between 0 to 1 at the result level instead of per iteration since one may pay more to account for missed payments, or less to address over payment last month. And I wish to address this possibility.
find_ratio <- function(bill, pay) {
sum <- 0
valid_month <- 0
for (i in 1:length(pay)) {
if (!is.na(bill[i]) & bill[i] > 0) {
sum <- sum + (pay[i]/bill[i])
valid_month <- valid_month + 1
}
}
return(max(min(sum/valid_month, 1), 0)) #bound between 0 to 1 to avoid skewed outliers, and payment ratio would typically between 0 to 1
}
credit_card_training <- credit_card_training |>
rowwise() |>
mutate(PAYMENT_RATIO = find_ratio(c_across(BILL_AMT2:BILL_AMT6), c_across(PAY_AMT1:PAY_AMT5))) |>
select(-c(BILL_AMT1:PAY_AMT6))
credit_card_training
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | PAY_6 | default.payment.next.month | PAYMENT_RATIO |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <dbl> |
| 12 | 260000 | 2 | 1 | 2 | 51 | -1 | -1 | -1 | -1 | -1 | 2 | 0 | 0.8030414 |
| 29 | 50000 | 2 | 3 | 1 | 47 | -1 | -1 | -1 | -1 | -1 | -1 | 0 | 1.0000000 |
| 31 | 230000 | 2 | 1 | 2 | 27 | -1 | -1 | -1 | -1 | -1 | -1 | 0 | 1.0000000 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29977 | 40000 | 1 | 2 | 2 | 47 | 2 | 2 | 3 | 2 | 2 | 2 | 1 | 0.03737737 |
| 29992 | 210000 | 1 | 2 | 1 | 34 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 0.00000000 |
| 29995 | 80000 | 1 | 2 | 2 | 34 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 0.04378178 |
Table 8, Credit Card Training and Payment Ratio Calculation
updated_ratio_plot <- credit_card_training |>
ggplot(aes(x = PAYMENT_RATIO, fill = default.payment.next.month)) +
geom_histogram(binwidth = 0.05) +
labs(x = "Average Payment Ratio", fill = "default next month")
updated_ratio_plot
Figure 6, Average Payment Ratio and Count of Default Next Month
Observation: We notice that at ratio of near 1 (full repayment) most would not default next month. However, the closer to the left we go (repayment ratio of 0), the higher the chance that people would default, hence this newly column would be a good predictor for our model.
Pay Status¶
After the above feature engineering, we notice that there are now 8 predictors, and 6 of which are PAY, since the L2 norm of a vector (euclidean distance) is calculated with sum of squares of each part of equal weights, the PAY most likely have too much impact on our resulting model. So I want to just keep the average pay status instead of all. Do notice that this used to be a categorical data with discrete values, but changing it to average would make it numerical.
find_average <- function(pay) {
return(sum(pay)/6)
}
credit_card_training <- credit_card_training |>
rowwise() |>
mutate(AVERAGE_PAY_STATUS = find_average(c_across(PAY_1:PAY_6))) |>
select(-c(PAY_1:PAY_6))
credit_card_training
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | default.payment.next.month | PAYMENT_RATIO | AVERAGE_PAY_STATUS |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <fct> | <fct> | <dbl> | <fct> | <dbl> | <dbl> |
| 12 | 260000 | 2 | 1 | 2 | 51 | 0 | 0.8030414 | -0.5 |
| 29 | 50000 | 2 | 3 | 1 | 47 | 0 | 1.0000000 | -1.0 |
| 31 | 230000 | 2 | 1 | 2 | 27 | 0 | 1.0000000 | -1.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29977 | 40000 | 1 | 2 | 2 | 47 | 1 | 0.03737737 | 2.166667 |
| 29992 | 210000 | 1 | 2 | 1 | 34 | 1 | 0.00000000 | 2.166667 |
| 29995 | 80000 | 1 | 2 | 2 | 34 | 1 | 0.04378178 | 2.000000 |
Table 9, Credit Card Training after Average Pay Status Training
updated_pay_plot <- credit_card_training |>
ggplot(aes(x = AVERAGE_PAY_STATUS, fill = default.payment.next.month)) +
geom_histogram(binwidth = 0.5) +
labs(x = "Average Pay Status", fill = "default next month")
updated_pay_plot
Figure 7, Average Pay Status and Count of Default Next Month
Observation: We notice that when the average is less than or equal to around 0, more people pay on time than default and when it's greater, more tends to default, which is what we are expecting, therefore we can also use this as a cleaned up predictor
Education¶
Let's convert the column into integers so we can run scale and center on there
credit_card_training <- credit_card_training |> mutate(EDUCATION = as.integer(EDUCATION))
credit_card_training
| ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | default.payment.next.month | PAYMENT_RATIO | AVERAGE_PAY_STATUS |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <fct> | <int> | <fct> | <dbl> | <fct> | <dbl> | <dbl> |
| 12 | 260000 | 2 | 2 | 2 | 51 | 0 | 0.8030414 | -0.5 |
| 29 | 50000 | 2 | 4 | 1 | 47 | 0 | 1.0000000 | -1.0 |
| 31 | 230000 | 2 | 2 | 2 | 27 | 0 | 1.0000000 | -1.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 29977 | 40000 | 1 | 3 | 2 | 47 | 1 | 0.03737737 | 2.166667 |
| 29992 | 210000 | 1 | 3 | 1 | 34 | 1 | 0.00000000 | 2.166667 |
| 29995 | 80000 | 1 | 3 | 2 | 34 | 1 | 0.04378178 | 2.000000 |
Table 10, Credit Card Training after Education Column Cast
Preproccessing of Test Dataset¶
Since the functions ran completely per obeservation, we can run it on the test dataset without issue. Do note I will not print the output to avoid seeing the test dataset in any way
credit_card_testing <- credit_card_testing |>
rowwise() |>
mutate(PAYMENT_RATIO = find_ratio(c_across(BILL_AMT2:BILL_AMT6), c_across(PAY_AMT1:PAY_AMT5))) |>
mutate(AVERAGE_PAY_STATUS = find_average(c_across(PAY_1:PAY_6))) |>
mutate(EDUCATION = as.integer(EDUCATION)) |>
select(-c(PAY_1:PAY_6), -c(BILL_AMT1:PAY_AMT6))
Predictors Choosen¶
From the preliminary data analysis, I have decided to keep education, payment_ratio, and average_pay_status
Data Analysis¶
We will be using knn classification to create the prediction model for this problem
Recipe Creation¶
Other than selecting the choosen predictors, we will also scale and center the data. Other than payment ratio and average pay status, which are now numerical from the data engineering, we note that although education is technically categorical, we can still treat it as numerical since there is a order between the categories, lower the number, higher the education.
credit_card_recipe <- recipe(default.payment.next.month ~ EDUCATION + PAYMENT_RATIO + AVERAGE_PAY_STATUS,
data = credit_card_training) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
Hyperparameter Tuning¶
We will be using 5 fold cross validation on the dataset
k_vals = tibble(neighbors = seq(from = 1, to = 200, by = 5))
credit_vfold <- vfold_cv(credit_card_training, v = 5, strata = default.payment.next.month)
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
knn_result <- workflow() |>
add_recipe(credit_card_recipe) |>
add_model(knn_spec) |>
tune_grid(resamples = credit_vfold, grid = k_vals) |>
collect_metrics()
accuracy_vs_k <- knn_result |>
filter(.metric == "accuracy") |>
ggplot(aes(x = neighbors, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Neighbors", y = "Accuracy Estimate")
accuracy_vs_k
Figure 8, Estimated Accuracy vs Number of Neighbors
Observation: We notice that after readching around 75 neighbots, the accuracy plateaus, indicating that the difference in accuracy when increasing neighbors doesn't change much. Hence it does not appear that we are not really underfitting. I would have wanted to do the test on bigger neighbors to see where the underfitting begins but due to the large size models it would take too long. But it does not appear that underfitting happens at least within this interval. So I will choose 75 as our neighbor count.
Running the Model¶
We now run the model on the test dataset
knn_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 75) |>
set_engine("kknn") |>
set_mode("classification")
credit_card_fit <- workflow() |>
add_recipe(credit_card_recipe) |>
add_model(knn_best_spec) |>
fit(data = credit_card_training)
credit_card_predictions <- predict(credit_card_fit, credit_card_testing) |>
bind_cols(credit_card_testing)
credit_card_predictions
| .pred_class | ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | default.payment.next.month | PAYMENT_RATIO | AVERAGE_PAY_STATUS |
|---|---|---|---|---|---|---|---|---|---|
| <fct> | <dbl> | <dbl> | <fct> | <int> | <fct> | <dbl> | <fct> | <dbl> | <dbl> |
| 0 | 22 | 120000 | 2 | 3 | 1 | 39 | 1 | 1 | -1 |
| 0 | 117 | 240000 | 1 | 2 | 2 | 28 | 0 | 1 | -1 |
| 0 | 119 | 400000 | 1 | 3 | 1 | 34 | 0 | 1 | -1 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 0 | 29710 | 250000 | 1 | 2 | 2 | 33 | 1 | 0.95498445 | 0.1666667 |
| 1 | 29802 | 260000 | 1 | 2 | 2 | 33 | 0 | 0.05482904 | 2.0000000 |
| 0 | 29866 | 150000 | 1 | 2 | 1 | 43 | 1 | 0.80000000 | 0.1666667 |
Table 11, Credit Card Test with Prediction
credit_card_accuracy <- credit_card_predictions |>
metrics(truth = default.payment.next.month, estimate = .pred_class)
credit_card_accuracy
| .metric | .estimator | .estimate |
|---|---|---|
| <chr> | <chr> | <dbl> |
| accuracy | binary | 0.7845850 |
| kap | binary | 0.5222673 |
Table 12, Prediction Accuracy Table
Observation: We notice that the accuracy is quite similar to the accuracy in the training dataset, indicating that we did not over fit on the training dataset.
credit_card_mat <- credit_card_predictions |>
conf_mat(truth = default.payment.next.month, estimate = .pred_class)
credit_card_mat
Truth
Prediction 0 1
0 556 121
1 97 238
Table 13, Credit Card Confusion Matrix
Observation: From this we find $$precision = \frac{238}{97 + 238} = 0.71$$ $$recall = \frac{238}{121 + 238} = 0.663$$ Although both are higher than 0.5, indicating that the model is better than just full on guessing, they are lower than what I would consider a good model. But this might be because there is higher number of none defaults than defaults, and the hyperparameter tuning is done based on accuracy instead of precision/recall
Visualization¶
options(repr.plot.width = 12, repr.plot.height = 5)
plot_1 <- ggplot(credit_card_predictions, aes(x = PAYMENT_RATIO, y = AVERAGE_PAY_STATUS, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Payment Ratio",
y = "Average Pay Status",
color="Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Pay Status vs Payment Ratio Plot (Actual)") +
theme(legend.position="bottom")
plot_2 <- ggplot(credit_card_predictions, aes(x = PAYMENT_RATIO, y = AVERAGE_PAY_STATUS, color = .pred_class )) +
geom_point(alpha=0.5) +
labs(x = "Payment Ratio",
y = "Average Pay Status",
color="Predicted Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Pay Status vs Payment Ratio Plot (Predicted)") +
theme(legend.position="bottom")
grid.arrange(plot_1, plot_2, nrow = 1, ncol = 2)
Figure 9, Pay Status vs Pay Ratio, colored by Actual and Predicted Default Next Month
Observation: We notice that the high average payment status and low payment ratio correlates with high chance of default, which the model captured pretty well
plot_1 <- ggplot(credit_card_predictions, aes(x = PAYMENT_RATIO, y = EDUCATION, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Payment Ratio",
y = "Education",
color="Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Education vs Payment Ratio Plot (Actual)") +
theme(legend.position="bottom")
plot_2 <- ggplot(credit_card_predictions, aes(x = PAYMENT_RATIO, y = EDUCATION, color = .pred_class )) +
geom_point(alpha=0.5) +
labs(x = "Payment Ratio",
y = "Education",
color="Predicted Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Education vs Payment Ratio Plot (Predicted)") +
theme(legend.position="bottom")
grid.arrange(plot_1, plot_2, nrow = 1, ncol = 2)
Figure 10, Education vs Pay Ratio, colored by Actual and Predicted Default Next Month
Observation: We notice that there doesn't seem to be that significant of a trend between education and payment ratio, but our model did capture that the right side contains more none defaults (high payment ratio) than defaults
plot_1 <- ggplot(credit_card_predictions, aes(x = AVERAGE_PAY_STATUS, y = EDUCATION, color = default.payment.next.month )) +
geom_point(alpha=0.5) +
labs(x = "Average Pay Status",
y = "Education",
color="Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Education vs Pay Status Plot (Actual)") +
theme(legend.position="bottom")
plot_2 <- ggplot(credit_card_predictions, aes(x = AVERAGE_PAY_STATUS, y = EDUCATION, color = .pred_class )) +
geom_point(alpha=0.5) +
labs(x = "Average Pay Status",
y = "Education",
color="Predicted Default Payment Next Month (1=yes, 0=no)") +
ggtitle("Education vs Pay Status Plot (Predicted)") +
theme(legend.position="bottom")
grid.arrange(plot_1, plot_2, nrow = 1, ncol = 2)
Figure 11, Education vs Pay Status, colored by Actual and Predicted Default Next Month
Observation: Similar to observation above we notice that there is no significant trend between pay status and education. But it did capture that higher pay status usually corresponds to higher likelihood of defaulting next month
Summary¶
In the model, we get an accuracy of 0.784, precision of 0.710, and recall of 0.663. Although it is better than what a base model would produced (all guess not default or all guess default) in my opinion it would still be an unstatisifactory score for such an important consideration in credit loaning, especially with a even lower recall, as it is probably more important to a loan entity that someone would default than not default.
I have multiple hypothesis on possible reasoning behind the low score despite seeing apparent patterns in the exploratory data analysis and feature engineering. But the main reason I believe would be the model choosen.
For one, we notice that the dataset contains majority of none defaults, so under a knn model, the neighbors will tend to be biased toward non defaults then defaults in the grey zone, which is backed by the higher precision than recall. Which seem to agree with investigation done by (Alam et al., 2020) where they also noticed the class imbalance impacted the score of the model, although under a different prediction model other than knn. We might be able to fix this through sampling the sample again to forcebly enforce equal sizes.
Another reason might be because under the L2 norm used by knn classification, all predictors are given equal weights, meaning the indicators that are weaker might cloud the result produced. In this specific data, it might be that education, which is a weak indicator, clouds the judgment made by ratio and status, the two stronger indicators created by sample engineering. Which leads me to think we should scale the predictors to introduce different weights.
If I had more time (and for the sake of the course) I would have choosen a different model. A simple alternative model would be logitstic regression, where I wouldn't need to scale the predictors and would capture the weight of importance of the predictors better. And it would allow me to view the coefficent of each predictor to view the importance of each predictor.
But in my opinion, the best model for this dataset that I am familar with would be a basic neural network. Where we would balance the parameters each layer to again, balance the weight of each predictor. Which seems to be what (Chen & Zhang, 2021) used in their research, although in a much more complex way than what I consider a basic model. But there is a significant improvement in accuracy.
All in all, the model is still important for credit entities since the decision to give out credits and loans are heavily dependent on whether the recipiant would default. However, we would likely want a higher accuracy model as a low accuracy might cause sigificant loss to the said entity as well as cause unneccessary problem to the applicant. And future questions would be how we can improve the model and see whether there are any hidden patterns within the discarded predictors that might hold significance.
References¶
Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S., Li, J., & Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access, 8, 201173–201198. https://doi.org/10.1109/access.2020.3033784
Chen, Y., & Zhang, R. (2021). Research on credit card default prediction based on K-means Smote and BP Neural Network. Complexity, 2021, 1–13. https://doi.org/10.1155/2021/6618841